Import the Competition Data from Kaggle API

Loading the Data and Libraries

Exploratory Data Analysis

The Goal

Uni-variate Analysis

Let's analyze the distribution of 'SalePrice', our target attribute.

Numerical variables can be a) binary, b) continuous or c) discrete. A priori, it is good practice to know what each variable means, to then be able to differentiate continuous from discrete variables. In this notebook, we will assume that variables with a definite and low number of unique values are discrete.

Finding discrete variables

To identify discrete variables, I will select from all the numerical ones, those that contain a finite and small number of distinct values.

Pairwise Correlation

Let's plot how SalePrice relates to some of the features in the dataset.

Feature Engineering

Let's take a look at the distribution of the SalePrice.

The SalePrice is skewed to the right. This is a problem because most ML models don't do well with non-normally distributed data. We can apply a log(1+x) tranform to fix the skew.

Let's plot the SalePrice again.

Fill in Missing Values

We can now move through each of the features above and impute the missing values for each of them.

There are no missing values anymore!

Fix Skewed Features

All the features look fairly normally distributed now.

Feature Creation

Let's help our models out by creating a few features based on our intuition about the dataset, e.g. total area of floors, bathrooms and porch area of each house.

Feature transformations

Let's create more features by calculating the log and square transformations of our numerical features. We do this manually, because ML models won't be able to reliably tell if log(feature) or feature^2 is a predictor of the SalePrice.

Encode categorical features

Numerically encode categorical features because most models can only handle numerical features.

Recreate Train and Test sets

Visualize some of the features we're going to train our models on.

Train a model

Key features of the model training process:

Setup cross validation and define error metrics

Setup Models

Stacking Regressors

Stacking is an ensemble learning technique to combine multiple regression models via a meta-regressor. The StackingCVRegressor extends the standard stacking algorithm (implemented as StackingRegressor) using out-of-fold predictions to prepare the input data for the level-2 regressor.

In the standard stacking procedure, the first-level regressors are fit to the same training set that is used prepare the inputs for the second-level regressor, which may lead to overfitting. The StackingCVRegressor, however, uses the concept of out-of-fold predictions: the dataset is split into k folds, and in k successive rounds, k-1 folds are used to fit the first level regressor. In each round, the first-level regressors are then applied to the remaining 1 subset that was not used for model fitting in each iteration. The resulting predictions are then stacked and provided -- as input data -- to the second-level regressor. After the training of the StackingCVRegressor, the first-level regressors are fit to the entire dataset for optimal prediciton.

Train models

Get cross validation scores for each model.

Blend Models and get training prediction

Identify the Best performing Model